02. Connect datalab

  • datalab은 Google Cloud에서 서비스하는 Jupyter notebook이라고 보면 됩니다
  • 최근 20170818 버전부터 python2, python3 커널을 사용할 수 있습니다 (기존에는 python2만 해당)
  • datalab은 Google Cloud Storage / BigQuery 등에서 IO 속도가 매우 빠릅니다
  • datalab install
  • datalab connect [datalab instance 명]

  • Google Datalab의 모습

In [2]:
import google.datalab.bigquery as bq

In [3]:
# Query 생성
query_string = '''
              #standardSQL
              SELECT corpus AS title, COUNT(*) AS unique_words
              FROM `publicdata.samples.shakespeare`
              GROUP BY title
              ORDER BY unique_words DESC
              LIMIT 10
              '''

In [4]:
query = bq.Query(query_string)

In [5]:
output_options = bq.QueryOutput.table(use_cache=True)
result = query.execute(output_options=output_options).result() # query 실행
result


Out[5]:
titleunique_words
hamlet5318
kinghenryv5104
cymbeline4875
troilusandcressida4795
kinglear4784
kingrichardiii4713
2kinghenryvi4683
coriolanus4653
2kinghenryiv4605
antonyandcleopatra4582

(rows: 10, time: 0.9s, cached, job: job_QQ4lwiAqh3S3oj06MU_ij2BD1egU)

pandas dataframe으로 변환


In [6]:
pandas_df = result.to_dataframe()

In [7]:
pandas_df


Out[7]:
title unique_words
0 hamlet 5318
1 kinghenryv 5104
2 cymbeline 4875
3 troilusandcressida 4795
4 kinglear 4784
5 kingrichardiii 4713
6 2kinghenryvi 4683
7 coriolanus 4653
8 2kinghenryiv 4605
9 antonyandcleopatra 4582

In [8]:
sample_dataset = bq.Dataset('bigquery-public-data.samples')

In [9]:
# dataset이 존재하는지 유무
sample_dataset.exists()


Out[9]:
True

%bq magic command를 이용한 project sample 확인


In [10]:
%bq datasets list --project cloud-datalab-samples


Out[10]:
  • cloud-datalab-samples.appenginelogs
  • cloud-datalab-samples.carprices
  • cloud-datalab-samples.httplogs

%%bq query를 사용해 바로 query문을 날림


In [12]:
%%bq query
#standardSQL
SELECT corpus AS title, COUNT(*) AS unique_words
FROM `publicdata.samples.shakespeare`
GROUP BY title
ORDER BY unique_words DESC
LIMIT 10


Out[12]:
titleunique_words
hamlet5318
kinghenryv5104
cymbeline4875
troilusandcressida4795
kinglear4784
kingrichardiii4713
2kinghenryvi4683
coriolanus4653
2kinghenryiv4605
antonyandcleopatra4582

(rows: 10, time: 1.4s, 2MB processed, job: job_yKN6rIwV8lYcd2URLkKMj_Jv-TNl)

매직커맨드를 활용해 pandas data frame 생성


In [17]:
%%bq query -n requests
SELECT timestamp, latency, endpoint
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Popular' OR endpoint = 'Recent'

In [18]:
df = requests.execute(output_options=bq.QueryOutput.dataframe()).result()
len(df)


Out[18]:
8392

In [19]:
df.head()


Out[19]:
timestamp latency endpoint
0 2014-06-15 14:41:43.465275 256 Recent
1 2014-06-15 17:25:47.865026 256 Recent
2 2014-06-15 19:40:43.414446 256 Recent
3 2014-06-16 01:38:26.398083 256 Recent
4 2014-06-16 03:32:55.519498 256 Recent

google chart api를 활용해 바로 chart 그리기 가능


In [13]:
%%bq query --name data
WITH quantiles AS (
  SELECT APPROX_QUANTILES(LOG10(latency), 50) AS timearray
  FROM `cloud-datalab-samples.httplogs.logs_20140615`
  WHERE latency <> 0
)
select row_number() over(order by time) as percentile, time from quantiles cross join unnest(quantiles.timearray) as time
order by percentile

In [14]:
%chart columns --data data --fields percentile,time


Out[14]: